New College of Florida - Stat Inference II

2018 PGA Tour Season Analysis

Amanda Bucklin
Nate Wagner


Summary

We will analyze different variables associated with money earned, average score, and wins on the PGA Tour during the 2018 season to uncover the relationships, effect sizes, and significance associated between them.


Variable Descriptions

  • Player Name
  • Rounds
  • Fairway Percentage
  • Avg Distance
  • Gir
    • The percent of time a player was able to hit the green in regulation (greens hit in regulation/holes played).
  • Average Putts
  • Average Scrambling
    • The percent of time a player misses the green in regulation, but still makes par or better.
  • Average Score
  • Points
    • The cumulative points for the year that the player has earned in the regular season of the FedExCup points race.
  • Wins
  • Top 10
  • Average SG Putts
    • The number of putts a player takes from a specific distance is measured against a statistical baseline to determine the player’s strokes gained or lost on a hole.
  • Average SG Total
    • The per round average of the number of strokes the player was better or worse than the field average on the same course & event.
  • SG:OTT
    • The per round average of the number of strokes the player was better or worse than the field average on the same course & event minus the Players Strokes Gained putting value.
  • SG:APR
    • The sum of the values for all holes played in a round minus the field average strokes gained/lost for the round is the player’s Strokes gained/lost for that round. The sum of strokes gained for each round are divided by total rounds played.
  • SG:ARG
    • The number of Around the Green strokes a player takes from specific locations and distances are measured against a statistical baseline to determine the player’s strokes gained or lost on a hole.
  • Driving Distance Dummy
    • 3 Levels
      • Long: a drive greater than 302.0 yards
      • Average: a drive 290.0-302
      • Short: a drive less than 290
  • Won That Year Dummy
    • Dummy variable indicating whether a player won that year
      • 1: Player won a PGA Tour event
      • 0: Player did not win a PGA Tour event


Exploratory Data Analysis


Response Variable

The distribution of money is skewed to the right. A log transform of that variable is needed here.






Explanatory Variables


Do some players make more money than others just because they play more?


While as a whole there doesn’t seem to be much a relationship between rounds played and money earned, however, if we ignore the players who won at least one tournament and only look at the red points, we do see a positive relationship.


Players who are winning are simply making more money on average. The mean earnings of PGA Tour players who won at least once is 3,979,965 dollars, and those who didn’t win was 1,285,266 dollars.


The relationship between average score and money is clearly non-linear and also negative. Makes perfect sense that the higher your average score, the amount of money made should be lower, due to lower performences in tournaments.


Do players who hit the ball further off the tee have an advantage?

As driving distance increases, there is in increase in money made. But the question is, are players making more money because they hit the ball further, or because they are shooting lower scores due to the fact they have an advantage off the tee?


Here we do see a negative relationship between average driving distance and average score, suggesting there may be an advantage to hitting the ball further on scores.


How does short game affect money made?

Average Scrambling is the percent of time a player misses the green, but still makes par or better. So essential this variable represents a players short game. Here we see, as average scrambling increases it doesn’t really have a big effect on money made.


However, average scrambling does have a big effect on average score, which in turn influences how much money a player makes. So what really is going on here, is that most of the variables have a strong releationship with average score, and thus is influencing how much money a player is making. So we decided to look into the variables that break down a golfer’s game, like driving, short game, putting, etc., and to use the player’s average score as a response.


The distribution of average score is approximately normal with a mean of 70.9 and standard deviation of 0.77.



Here are the remaining variables and their relationship with average scores.



Relationship Between Categorical Variables

No Yes
short 0.90 0.10
average 0.84 0.16
long 0.71 0.29


Chi-Squared Test \[ H_O: \text{Winning at least one tournament is independent of driving distance} \\ vs \\ H_A: \text{Winning at least one tournament depends on driving distance} \]


\[ \chi^2 = \sum\frac{(\mathrm{observed} - \mathrm{expected})^2}{\mathrm{expected}} \]


\[ \chi^2 \; | \; H_O \sim \chi^2_{df = 2} \]


\[ \chi^2 \; = 6.438 \]


\[ \text{Due to the small p value, we reject the null hypothesis and conclude winning at least one tournament depends on driving distance} \]


## 
##  Pearson's Chi-squared test
## 
## data:  golf_df$won_that_year and golf_df$driving_dist_cat
## X-squared = 6.438, df = 2, p-value = 0.03999



Linear Regression Models



Model With All Variables:

\[ \mathrm{Average.Score} = \beta_0 + \beta_1 \mathrm{Rounds_i} + \beta_2 \mathrm{FairwayPerc_i} + \beta_3 \mathrm{GIR_i} + \beta_4 \mathrm{Avg.Putts_i} + \beta_5 \mathrm{Avg.Scrambling_i} + \beta_6 \mathrm{Avg.SG.Putts_i} \\ + \beta_7 \mathrm{SG.OTT_i} + \beta_8 \mathrm{SG.APR_i} + \beta_9 \mathrm{SG.ARG_i} + \beta_{10} \mathrm{Dist.Avg_i} + \beta_{11} \mathrm{Dist.Long_i} + \beta_{12} \mathrm{Won_i} + e_i \\ \hspace{1cm} {e}_i \sim i.i.d. \hspace{1mm} \mathcal{N} (0,\,\sigma^{2})\,. \]


Variance Inflation Factors

##                        GVIF Df GVIF^(1/(2*Df))
## Rounds             1.177485  1        1.085120
## Fairway.Percentage 2.550392  1        1.596995
## gir                5.513885  1        2.348166
## Average.Putts      5.473769  1        2.339609
## Average.Scrambling 2.948572  1        1.717141
## Average.SG.Putts   2.454406  1        1.566654
## SG.OTT             3.998934  1        1.999733
## SG.APR             2.122350  1        1.456829
## SG.ARG             1.977045  1        1.406074
## driving_dist_cat   3.482365  2        1.366056
## won_that_year      1.189684  1        1.090726

Greens in regulation and average putts both show a high variance inflation factor. So after removing gir, the VIF on average putts actual fell below the thereshold of 5. We then performed stepwise backwards selection and we ended up with the following model.



Model From Backwards Selection Via AIC Criterion:

\[ \mathrm{Average.Score} = \beta_0 + \beta_1 \mathrm{Rounds_i} + \beta_5 \mathrm{Avg.Scrambling_i} + \beta_6 \mathrm{Avg.SG.Putts_i} \\ + \beta_7 \mathrm{SG.OTT_i} + \beta_8 \mathrm{SG.APR_i} + \beta_9 \mathrm{SG.ARG_i} + \beta_{12} \mathrm{Won_i} + e_i \\ \hspace{1cm} {e}_i \sim i.i.d. \hspace{1mm} \mathcal{N} (0,\,\sigma^{2})\,. \]

## 
## Call:
## lm(formula = Average.Score ~ Rounds + Average.Scrambling + Average.SG.Putts + 
##     SG.OTT + SG.APR + SG.ARG + won_that_year, data = golf_df_slim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59703 -0.12928  0.00706  0.12552  0.62939 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        72.298783   0.385017 187.781  < 2e-16 ***
## Rounds             -0.003058   0.001120  -2.730 0.006956 ** 
## Average.Scrambling -0.016681   0.006496  -2.568 0.011021 *  
## Average.SG.Putts   -0.944199   0.058233 -16.214  < 2e-16 ***
## SG.OTT             -0.981893   0.044361 -22.134  < 2e-16 ***
## SG.APR             -0.987245   0.048714 -20.266  < 2e-16 ***
## SG.ARG             -0.824935   0.091928  -8.974 3.29e-16 ***
## won_that_yearYes   -0.142803   0.041725  -3.423 0.000765 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2075 on 184 degrees of freedom
## Multiple R-squared:  0.9306, Adjusted R-squared:  0.9279 
## F-statistic: 352.2 on 7 and 184 DF,  p-value: < 2.2e-16

This model includes a variable that accounts for each part of a golfers game, including driving, approaching, chipping, and putting. While also controlling for factors such as rounds played and whether a player won at least one tournament or not.



\[ \textbf{Quality of fit: } \text{Our regression model, including 7 golf characteristics, accounts for 93% of the variability in average scores} \\ \text{and the associated residual standard error is 0.208, implying our model misses the true values by 0.208} \]

Slope Interpretations:
\[ \textbf{ Average SG Putts: } \text {Per stroke increase in average strokes gained putting, holding rounds,} \\ \text{average scrambling, strokes gained approaching the green, strokes gained off the tee, and winning at} \\ \text{ least one tournament in 2018 constant, on average we expect a 0.944 decrease in average score.} \]

\[ \textbf{ SG OTT: } \text {Per stroke increase in strokes gained off the tee, holding rounds,} \\ \text{average scrambling, strokes gained approaching the green, average strokes gained putting, and winning at } \\ \text{least one tournament in 2018 constant, on average we expect a 0.981 decrease in average score.} \]

\[ \textbf{ SG APR: } \text {Per stroke increase in strokes gained approaching the green, } \\ \text{holding rounds, average scrambling, average strokes gained putting, strokes gained off the tee, and winning at} \\ \text{ least one pga tournament in 2018 constant, on everage we expect a 0.987 decrease in average score. } \]



Testing for Model Significance:
\[ \mathrm{H_O:} \beta_1 = \beta_2 = ... = \beta_7 = 0 \\ vs \\ H_A: \mathrm{at \; least \; \beta_j \neq 0} \\ \]

\[ \mathrm{FS} = \frac{\mathrm{RegSS / p}}{\mathrm{RSS}/(n-(p+1))} | \: \mathrm{Ho} \sim F_{p, \; n-(p+1))} \\ \]

\[ \mathrm{FS} = 352.2\\ \]

\[ \mathrm{p \;value = 2.2e-16} \\ \]

\[ \text{Based on the tiny p value, we reject the null hypothesis and conclude} \\ \text{the model is statistically significant} \]



Confidence Interval Interpretations:
\[ \textbf{ Average SG Putting: } \text{We are 95% confident that per stroke increase in average strokes gained putting,} \\ \text{holding rounds,} \text{ average scrambling, strokes gained approaching the green, strokes gained off the tee, and winning at} \\ \text{ least one tournament in 2018 constant, we expect average score to decrease between 1.059 and 0.83 strokes, on average.} \]

\[ \textbf{ SG OTT: } \text{We are 95% confident that per stroke increase in strokes gained off the tee,} \\ \text{holding rounds,} \text{ average scrambling, strokes gained approaching the green, average strokes gained putting, and winning at } \\ \text{least one tournament in 2018 constant, we expect the average score to decrease between 1.069 and 0.89 strokes, on average.} \]

\[ \textbf{ SG APR: } \text{We are 95 percent confident that per stroke increase in strokes gained approaching the green,} \\ \text{holding rounds, average scrambling, average strokes gained putting, strokes gained off the tee, and winning at} \\ \text{ least one pga tournament in 2018 constant, we expect the average score to decrease between 1.01 and 0.64 strokes, on average.} \]



Model Diagnostics

There doesn’t seem to be any discernable patters in the residuals and the constant variance of residuals is satisfied here.

Normality of error terms is also satisfied.


There does seem to be a possible interaction between rounds played and whether a player won at least one tournament that year. It makes sense that for players who are winning tournaments, playing more rounds doesn’t seem to have a big effect on average score becuase they are already playing good enough to win. However, players who aren’t winning, seem to be playing better the more that they play, suggesting a negative relationship between rounds played and average score.


Full Modeling Equation With Interaction Term:

\[ \mathrm{Average.Score} = \beta_0 + \beta_1 \mathrm{Rounds_i} + \beta_5 \mathrm{Avg.Scrambling_i} + \beta_6 \mathrm{Avg.SG.Putts_i} \\ + \beta_7 \mathrm{SG.OTT_i} + \beta_8 \mathrm{SG.APR_i} + \beta_9 \mathrm{SG.ARG_i} + \beta_{12} \mathrm{Won_i} + \beta_{13} (\mathrm{Rounds \times Won}) + e_i \\ \hspace{1cm} {e}_i \sim i.i.d. \hspace{1mm} \mathcal{N} (0,\,\sigma^{2})\,. \]

Incremental F-Test to test significance of interaction:

\[ \mathrm{H_O:} \; \beta_{13} = 0 \\ vs \\ \mathrm{H_A:} \beta_{13} \neq 0 \\ \]

\[ \mathrm{FS} = \frac{\mathrm{RegSS_F - RegSS_N / q}}{\mathrm{RSS_F}/(n-(p+1))} | \: \mathrm{Ho} \sim F_{q, \; n-(p+1))} \\ \]

\[ \mathrm{FS} = 17.715 \]

\[ \mathrm{p \;value = 4.017e-05} \\ \]

\[ \text{Based on the tiny p value, we reject the null hypothesis and conclude} \\ \text{the interaction term is statistically significant} \]

## Analysis of Variance Table
## 
## Model 1: Average.Score ~ Rounds + Average.Scrambling + Average.SG.Putts + 
##     SG.OTT + SG.APR + SG.ARG + won_that_year
## Model 2: Average.Score ~ Rounds + Average.Scrambling + Average.SG.Putts + 
##     SG.OTT + SG.APR + SG.ARG + won_that_year + Rounds * won_that_year
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    184 7.9220                                  
## 2    183 7.2228  1   0.69919 17.715 4.017e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



Detection of Influential Outliers


Regression Outliers

Leverage of Outliers


We went ahead and removed points 28, 39 and 183 and fit the model again.

R-Squared with outliers:
- 0.9367
R-Squared without outliers:
- 0.936
RSE with outliers:
- 0.1987
RSE without outliers:
* 0.193

Due to the very small differences in the model, we decided not to remove the outliers.



Logistic Regression Models


After checking for collinearity and then performing variable selection via backwards AIC, we obtained the following model:


Full Modeling Equation

\[ p_i = P(\mathrm{Won} = 1 \; | \; \mathrm{Average.SG.Putts_i}, \; \mathrm{SG.OTT_i}, \; \mathrm{SG.APR_i}) \\ \]

\[ Y_i \sim \mathrm{ind} \; Bin(1, p_i) \]

\[ \mathrm{log(\frac{p_i}{1-p_i})} = \beta_0 + \beta_1 \; \mathrm{Average.SG.Putts_i} + \beta_2 \; \mathrm{SG.OTT_i} + \beta_3 \; \mathrm{SG.APR_i} \]

## 
## Call:
## glm(formula = won_that_year ~ Average.SG.Putts + SG.OTT + SG.APR, 
##     family = "binomial", data = log_regres_golf_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2189  -0.6195  -0.4740  -0.2493   2.7917  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.0857     0.2726  -7.650 2.01e-14 ***
## Average.SG.Putts   1.5962     0.7044   2.266  0.02344 *  
## SG.OTT             1.8500     0.6861   2.696  0.00701 ** 
## SG.APR             1.4630     0.6473   2.260  0.02382 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 179.31  on 191  degrees of freedom
## Residual deviance: 156.70  on 188  degrees of freedom
## AIC: 164.7
## 
## Number of Fisher Scoring iterations: 5


In Sample Prediction Accuracy and Confusion Matrix


Due to the data being heaviliy skewed and there being a lot more players who didn’t win at least one tournament, the predictions are also skewed.



Checking for Model Significance

\[ H_O: \beta_1 = \beta_2 = \beta_3 = 0 \\ vs \\ H_A: \mathrm{at \; least \; \beta_j \neq 0} \\ \]

\[ \mathrm{p \; value: 4.89e-05} \\ \]

\[ \text{Based on the tiny p value when comparing the full model to the null model, we can reject } \\ \text{the null hypothesis and claim the model is statistically significant} \]

## Analysis of Deviance Table
## 
## Model 1: won_that_year ~ 1
## Model 2: won_that_year ~ Average.SG.Putts + SG.OTT + SG.APR
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       191     179.31                         
## 2       188     156.71  3   22.601 4.89e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



Slope Interpretations:
\[ \textbf{Probability: } \text{As average strokes gained putting increase, holding strokes gained off the tee} \\ \text{and strokes gained approaching the green constant, the probability of winning also increases.} \]

\[ \textbf{Log-odds: }\text{Per stroke increase in average strokes gained putting, holding strokes gained off the tee } \\ \text{and strokes gained approaching the green constant, the log-odds of winning increase by 1.596} \]

\[ \textbf{Odds: } \text{Per stroke increase in average strokes gained putting, holding strokes gained off the tee} \\ \text{and strokes gained approaching the green constant, the odds of winning multiply by 4.93} \]

\[ \textbf{Odds: } \text{Per stroke increase in strokes gained off the tee, holding average strokes gained putting } \\ \text{and strokes gained approaching constant, the odds of winning multiply by 6.35.} \]

\[ \textbf{Odds: } \text{Per stroke increase in strokes gained approaching, holding average strokes gained putting } \\ \text{and strokes gained off the tee constant, the odds of winning multiply by 4.31.} \]



Confidence Interval Interpretations \[ \textbf{ Average SG Putting: } \text {With 95% confidence, we expect that a one stroke increase in average strokes gained putting, holding strokes gained}\\ \text{off the tee and strokes gained approaching the green constant, the odds of winning at least one pga tournament multiply between 1.28 and 20.5762. } \]